1 Backgraound

Ordinary differential equations (ODE) based simulation is a popular choice on modeling real-world systems, including metabolic networks, gene regulatory systems and epidemiology (Battogtokh et al. 2002; Yu et al. 2007; Keeling and Rohani 2008). For the model to be representative, inverse problem often need to be solved: estimating unknown parameters or links from experimental measurements. While ODE can often capture the real-world dynamics, solving the inverse problem is often hard, involving considerable ODE simulations and nonlinear optimization (Battogtokh et al. 2002).

Inference of model parameters is a nonlinear optimization problem: evaluating model prediction with measurements by objective function, and sampling a model of better performance (Figure 1.1 ). This process requires considerable iterations of producing and evaluating different models. A simplified model of central metabolism involves 32 equations and ~200 parameters and is fit with ~1000 data points. A Markov chain Monte Carlo (MCMC) ( 6.1 ) algorithm (Battogtokh et al. 2002) needs around \(10^7\) ODE simulations. Even with modern ODE solver (Adaptive Backward Differentiation), it still takes about a week to finish. When the parameter set is stiff, considerably more time is needed, if not impossible. Previously, we construct a system to systematically ignore stiff parameter sets (pearc2019). This reduces the simulation time cost, while at the same time unavoidably introduce biases.

Figure 1.1: Inverse problem diagram.

While ODE solver has adaptive time cost, neural network has relative constant cost, which is also often small. There has been success in simulating dynamic system by neural network (Breen et al. 2020; Lutter, Ritter, and Peters 2019; Greydanus, Dzamba, and Yosinski 2019) but a high dimension initial value problem hasn't been carefully studied yet. In initial value problem, the input is parameters of the assumed model and the output is the time dynamics. This is often the inner loop of the inverse problem. In this project, we explored the possibility of replacing ODE solver with trained neural networks for initial value problem. The trained neural network can later be used in solving inverse problems.

2 Methods

Dynamics (\(Y(t)\)) of an ODE system solely depends on initial conditions (\(Y_{ini}\)), model parameters (\(\theta\)), and time (\(t\)). In both ODE solver and neural network, we can treat the vector \(X=[\theta, Y_{ini}, t]\) as the input and the vector \(Y(t)\) as the output. Simulation of such a system by ODE solver can produce pairs \([X, Y(t)]\) as training data.

2.1 Neural network architecture

ResNet has been a popular structure in multiple task including image classification and it simplifies the training process by identity matching (He et al. 2016). In our modified architecture, we replaced the convolutional layer with linear layer. For example, resnet18_mlp is similar to original resnet18 (He et al. 2016), except that we used nn.Linear to replace nn.Conv2d (find more details here and the following code block). The hidden layer size was controlled by its ratio (\(R\)) to the input layer size so, network complexity was automatically adjusted in dynamic systems of different size. Batch normalization and RELU (rectified linear unit) were used in the architecture (Ioffe and Szegedy 2015). Adam was used as the optimizer (Kingma and Ba 2015). Hyperparameter tuning was done both manually and by Optuna (Akiba et al. 2019). Mean squared error (MSE) was the objective function. 20% of samples are leaved out for testing. Besides ResNet, multilayer perceptron (MLP) and recurrent structures were also tested. The neural network was implemented in PyTorch and shared in GitHub.

class BasicBlock(nn.Module):
    expansion=1
    __constants__=['downsample']

    def __init__(self,inplanes,planes,downsample=None,groups=1,
                 base_width=64,norm_layer=None,p=0.0):
        # inplanes: input size
        # planes: internal size
        # downsample: whehter downsample the idnetify mapping. default: None
        # groups: was used for "Aggregated Residual Transformation" (not used currently) Defualt 1
        # width_per_group: used for "Wide Residual Networks" and "Aggregated Residual Transformation" Defualt 64
        # norm_layer: used to specify batch normalization function. Default None
        # p: dropbout probability Default 0.0
        super(BasicBlock,self).__init__()
        if norm_layer is None:
            norm_layer=nn.BatchNorm1d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        self.fc1=line1d(inplanes,planes)
        self.bn1=norm_layer(planes)
        self.relu=nn.ReLU(inplace=True)
        self.fc2=line1d(planes,inplanes)
        self.bn2=norm_layer(inplanes)
        self.downsample=downsample
        self.p=p

    def forward(self,x):
        identity=x
        out=F.dropout(self.fc1(x),training=self.training,p=self.p)
        out=self.bn1(out)
        out=self.relu(out)
        out=F.dropout(self.fc2(out),training=self.training,p=self.p)
        out=self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out=self.relu(out)
        return out

The model is trained on P100 GPUs of both sapelo2 at GACRC and PSC Bridges at XSEDE.

2.2 Training data simulation

We simulated dynamic systems of different complexities as training samples. The coupled linear ODE system is simulated with different dimensions (4-32) (2.1), fixed topology, and 10000 random generated parameter sets. We shifted the eigen value to be negative in real part to resemble most real-world systems. For linear ODE, we produced analytic solutions without ODE solver and list detailed procedures in 6.2. Simulated dynamics were normalized to simplify result interpretation: \(\widetilde{Y_n(t_m)} = \frac{Y_n(t_m)}{\sqrt{\Omega_{n}}}\) and \(\Omega_{n}=\sum^{M}_{m=1}{|Y_n(t_m)|^2}\).

Table 2.1: Dimensions of simulated data set. The equations are simualted in complex number so, here number of Ys (real and imaginary) are double the dimension of the system. Refer to 6.2 for details in simulation
dimension of the system number of theta number of Ys number of inputs number of outputs
2 6 4 11 4
3 10 6 17 6
4 14 8 23 8
8 30 16 47 16
12 54 24 79 24
16 74 32 107 32

3 result

We have successfully approximated solutions for low dimension ODE. For ODE system with six equations, the neural network can achieve MSE of 0.2 for both training and testing set after 500 epochs (Figure 3.1) Some representative cases of training and testing samples are presented (Figure 3.2). As similar performance exists for training and testing sets, further training and hyperparameter tuning can probably improve the results. From Figure 3.2 we can also see that different dynamics can be approximated. Similar performance can also be achieved by MLP.

Figure 3.1: MSE through epochs for multiple training settings. Preliminary training results are presented. Solid (dotted) line represent the MSE on training (testing) set. The best models is based on ResNet18, have hidden layer size eight times that of the input layer and use Adam optimizer. Details on training process can be found in Methods 2.2.

Figure 3.2: Fitting results of preliminary trained model. The first (second) row is the model performance in the training (testing) set. X-axis is time and Y-axis is normalized \(Y\). Two cases are presented for both training and testing set. The blue curve is the ground truth (simulation) and the orange curve is neural network prediction. Details in neural network training can be found in Methods 2.2.

The neural network performances decay with higher dimensions of the ODE set. Specifically, for ODE system with 32 equations, the model performance is still not good after extensive hyperparameter tuning (including by Optuna (Akiba et al. 2019)) and trying different technical options. We have tried multiple options of scheduler, recurrent architectures, and optimizer. We also tried using larger training sample. Among these trials, ReduceLROnPlateau scheduler, Adam optimizer, and larger sample size helped.

4 Next steps

  1. Training the neural network on larger data sets. Efficient approaches are needed for simulating and training in large data sets. Both processes will need parallization. A separate data loading process is also necessary because of the memory cost.

  2. Data augmentation is needed to promote training. Some new transformation of time dynamics has been implemented and is currently under testing.

5 Aknowledgement

We thank GACRC and XSEDE for computational resources and technical supports.

6 FAQ

6.1 Why is MCMC used instead of other nonlinear optimizer?

MCMC is often used to generate random samples from complex distributions. By including MSE in transition probability, MCMC can converge to a local fit region in the parameter space. This local region contains many parameter sets (the model ensemble) and each sampled model from MCMC will have similar MSE on the experimental measurements. An ensemble of similar fitted models gives a natural uncertainty estimation for the fitting. This is particularly crucial when less training data is available compared with the model complexity.

6.2 The detailed procedure for simulating linear ODE

The procedure is to compute the ODE \(\frac{dY(t)}{dt}=HY(t)\) and format the input and output for neural network. There is \(N\) dimensions (equations of complex number) and \(M\) time points.

  1. Random select non-zero elements (NE) for \(H\) matrix. Diagonal elements are always included and they represent the contribution of a value to its own derivative. This topology (location of non-zero elements) is fixed for the following simulations and recorded in \(I_{conn}\)

  2. Generate \(\widetilde{H}\)

    For each NE, sample for real and imaginary parts from \(Unif(-1,1)\)

  3. Eigen value shifting

    Decomposition \(\widetilde{H}=U'D'V'\) and \(d'=diag(D')\)

    \(\Delta{d'}=|max(real(d'))|\)

    \(H=\widetilde{H}-S \Delta d' I\) where \(S=1.01\) to make sure all eigen values haave negative real part and so the dynamic system doesn't diverge.

    Decomposition \(H=UDV\)

  4. Generate \(Y_{ini}\)

    Sample real and imaginary parts from \(Unif(-1,1)\)

  5. Generate time series for time grid \(\hat{t}=[t_1, ..., t_M]\)

    \(A=VY_{ini}\)

    \(B=[u_1 a_1, ... u_k a_k, ..., u_Na_N]=U \ diag(A)\), \(u_k\) is the k-th column vector of U

    \(E=[e^{\hat{d} \hat{t}} ]\) where \(\hat{d}=diag(D)\)

    \(Y(t)=BE\)

  6. Rescale time series

    For every equation (\(n\)) in \(N\), the normalized result is

    \(\widetilde{Y_n(t_m)} = \frac{Y_n(t_m)}{\sqrt{\Omega_{n}}}\) where \(\Omega_{n}=\sum^{M}_{m=1}{|Y_n(t_m)|^2}\)

  7. Formulate input and output for neural network

    Input: \([Re(H(I_{conn})), Im(H(I_{conn})), Re(Y_{ini}), Im(Y_{ini}),\hat{t}]\)

    Output: \([Re(Y(t)), Im(Y(t)]\)

Reference

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage, AK, USA, 2623–31. doi:10.1145/3292500.3330701.

Battogtokh, D., D. K. Asch, M. E. Case, J. Arnold, and H. B. Schuttler. 2002. “An Ensemble Method for Identifying Regulatory Circuits with Special Reference to the Qa Gene Cluster of Neurospora Crassa.” Proc Natl Acad Sci U S A 99 (26): 16904–9. doi:10.1073/pnas.262658899.

Breen, P. G., C. N. Foley, T. Boekholt, and S. P. Zwart. 2020. “Newton Versus the Machine: Solving the Chaotic Three-Body Problem Using Deep Neural Networks.” Monthly Notices of the Royal Astronomical Society 494 (2): 2465–70. doi:10.1093/mnras/staa713.

Greydanus, Samuel, Misko Dzamba, and Jason Yosinski. 2019. “Hamiltonian Neural Networks,” 15379–89. http://papers.nips.cc/paper/9672-hamiltonian-neural-networks.pdf.

He, K. M., X. Y. Zhang, S. Q. Ren, and J. Sun. 2016. “Deep Residual Learning for Image Recognition.” 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 770–78. doi:10.1109/Cvpr.2016.90.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 448–56.

Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980.

Lutter, Michael, Christian Ritter, and Jan Peters. 2019. “Deep Lagrangian Networks: Using Physics as Model Prior for Deep Learning,” arXiv:1907.04490. https://ui.adsabs.harvard.edu/abs/2019arXiv190704490L.

Yu, Y., W. Dong, C. Altimus, X. Tang, J. Griffith, M. Morello, L. Dudek, J. Arnold, and H. B. Schuttler. 2007. “A Genetic Network for the Clock of Neurospora Crassa.” Proc Natl Acad Sci U S A 104 (8): 2809–14. doi:10.1073/pnas.0611005104.